ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
作者信息
University of Central Florida大学Jun Wang实验室下的论文
链接:[2403.17312] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
摘要
The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this paper, we propose ALISA, a novel algorithm-system co-design solution to address the challenges imposed by KV caching.【稀疏Attention计算】 On the algorithm level, ALISA prioritizes tokens that are most important in generating a new token via a Sparse Window Attention (SWA) algorithm. SWA introduces high sparsity in attention layers and reduces the memory footprint of KV caching at negligible accuracy loss. On the system level, ALISA employs three-phase token-level dynamical scheduling【三级调度】 and optimizes the trade-off between caching and recomputation【重计算和缓存】, thus maximizing the overall performance in resource-constrained systems【资源受限系统】. In a single GPU-CPU system, we demonstrate that under varying workloads, ALISA improves the throughput of baseline systems such as FlexGen and vLLM by up to 3X and 1.9X, respectively.
一句话总结概括
Motivation
假如把KV Cache放在GPU上,则可以增加计算时间,但会带来内存问题;假如把KV Cache放在CPU上,则可以减少内存问题,但会增加访存时间开销。当workload比较小的时候,把数据都放在GPU上显然是高效的;但workload比较大的时候,把KV Cache都存放在GPU可能会带来Out of memory,把数据都存放在CPU又会带来额外的通信开销。本工作主要就是讨论如何解决这一问题。
分析attention权重的稀疏性
- 高度稀疏
- model权重越大,稀疏性越大
创新点或贡献
- 识别重要的Attention:SWA
- KV Cache管理:三阶段调度器
- Caching和重计算调度
具体设计
- SWA Attention
- a:原始Attention
- b:LongFormer
- c:SparseTransformer
- d:ALISA
- ρ:斯皮尔曼等级相关系数
采用局部静态+全局动态的思想
静态就是最近的两个token
全局动态则是从Attention Score最高的那部分取出来进行计算
- 将token的budget分成两份,一份global,一份local。
不算出所有AW,是怎么做S计算的?还是说没有信息的token为0。
The local attention sum is also reduced along the head dimension (not shown for conciseness).
- 是只做第零层,然后读取需要的KV Cache吗?
- KV Cache管理
三阶段
- KV Cache都在GPU上
- 句子更长,KV Cache在GPU上放不下,一部分KV Cache在CPU上
- 句子更长,从CPU上删除一部分token,需要时再在GPU上重新计算
其中
- 二三阶段 Local KV Cache存放在GPU中,前面的存在CPU中
- 基于启发式的缓存机制带来的动态性,在第三阶段删除CPU中最古老的KV Cache,并在需要时重新计算这部分KV Cache
- 引入了load和compute的算法,最小化运行时间
量化